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Genome sequencing in microfabricated 
high-density picolitre reactors 
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The proliferation of large-scale DNA-sequencirtg projects in recent years has driven a search for alternative methods to 
reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly 
greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of 
individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an 
approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an 
emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol 
optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput accuracy and robustness 
of this system by shotgun sequencing ?nd Se novo assembly of the Mycoplasma genitalium genome with 96% coverage 
at 99.96% accuracy in one run of the machine. 



DNA sequencing has markedly changed the nature of biomedical 
research and medicine. Reduction', in the cart, complexity and time 
required to sequence large amounts of DNA, including improve- 
ments in the ability to sequence bacterial and eukaryotic genomes, 
will have significant scientific, economic and cultural impact. Large- 
scale sequencing projects, including whole-genome sequencing, have 
usually required the cloning of DNA fragments into bacterial vectors, 
amplification and purification of individual templates, followed by 
Sanger sequencing 1 using fluorescent chain- terminating nucleotide 
analogues 4 and cither slab gel or capillary electrophoresis. Current 
estimates put the cost of sequencing a human gjLuomc between $10 
million and $25 million 3 . Alternative sequencing methods have been 
described*"*; however, no technology has displaced the use of 
bacterial vectors and Sanger sequencing as the main generators of 
Sequence information. 

Here wc describe an integrated system whose throughput routinely 
enables applications requiring millions of bases of sequence infor- 
mation, including whole-genome 5<*qucncing. Our focus has been on 
the co-devdopment of an emulsion-based method 1 * -11 to isolate and 
amplify DNA fragments in vitro, and of a fabricated substrate and 
instrument that performs pyro phosphate-based sequencing (pyro- 
sequencing 5 ' 12 ) in pi col it re-si zed wells. 

In a typical run we generate over 25 million bases with a Phred 
quality score of 20 or better (predic ted to have on accuracy of 99% or 
higher). Although this Phred 20 quality throughput is significantly 



higher than Lhal gf Sanger sequencing by capillary electrophoresis > it 
is currently ar. the cost of substantially shorter reads and lower 
average individual read accuracy. S anger-based capillary electrophor- 
esis sequencing systems produce up to 700 bases of sequence 
information from each of 96 DNA templates at an average read 
accuracy of 99.4% in 1 h, or 67,000 bases per hour, with substantially 
all of the bases having Phred 20 or better quality". Wc farther 
characterize the performance of the system and demonstrate that it 
is possible to assemble bacterial genomes de novo from relatively 
short reads by sequencing a known bacterial genome, Mycoplasma 
gcmtalitim (560.069 bases), and comparing our shotgun sequencing 
and dc novo assembly with Lhc results originally obtained for dus 
genome' 3 . The result*; of shotgun sequencing and 4fi novo assembly of 
a larger bacterial genome, that of Streptococcus pneumoniae 14 
(2.1 megabases (Mb)), are presented in Supplementary Table 4. 

Emulsion-based sample preparation 

We generate random libraries of DNA fragments by shearing an 
entire genome and isolating single DNA molecules by limiting 
dilution (Supplementary Methods). Specifically, we randomly frag- 
ment the entire genome, add specialized common adapters to the 
fragments, capture the individual fragments on their own beads and, 
vrithin the droplets of an emulsion, clonally amplify the hidividual 
fragment (Fig, la, b). Unlike in Current sequencing technology, our 
approach does not require Subcloning in bacteria or the handling of 
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individual clones; the templates arc handled in bulk within the 
emulsions 1 *" 11 . 

Sequencing in fabricated pkolitre-sfced reaction vessels 

We perform sequencing by synthesis simultaneously in open wells of 
a fibre-optic slide using a modified pyroscqueneing protocol that is 
designed to cake advantage of the small scale of the wells. The fibre- 
optic slides arc manufactured by slicing of a fibre-optic blnck thai is 
obtained by repeated drawing and fusing of optic fibres. At each 
iteration, lie diameters of the individual fibres decrease as they are 
hexagonally packed into bundles of increasing cross*sectional sizes. 
Each fibre-optic core is 44p.m in diameter and surrounded by 
2-3 p.m of cladding etching of each core creates reaction wells 
approximately 55 fixn in depth with a cenire-to-centre distance of 
50 M- m (Rg- 1c), resulting in a calculated well size of 75 pi and a well 
density of 430wcllsmni~ 2 - The slide, containing approximately 1.6 
rnillion welJs ls > is loaded with beads and mounted in a flow chamber 
designed to create a 300-iim high channel, above the well openings 
through which the sequencing reagents flow (Fig. 2a, b). The 
unctched base of the slide is in optical contact with a second ribre- 
optic imaging bundle bonded to a charge-coupled device (CCD) 




sensor, allowing the capture of emitted photons from the bottom of 
each individual well (Fig. 2c; sec also Supplementary Methods). 

We developed a three-bead system, a nd optimized the components 
to achieve high efficiency on solid support The combination of 
picolitre-sized wells, enzyme loading uniformity allowed by the small 
beads and enhanced solid support chemistry enabled us to develop a 
method that extends the useful read length of sequendng-by-syn- 
thesis to 100 bases (Supplementary Methods). 

In the flow chamber cyclically delivered reagents flow perpendi- 
cularly to the wells. This configuration allows simultaneous exten- 
sion reactions On template-carrying beads within the Open wells and 
relies on convective and diffusive transport to control the addition or 
removal of reagents and by-products. Thetimescalc fbr diffusion into 
and out of the wells is on the order of 10 s in the current configuration 
and is dependent on well depth and flow channel height. The 
timcscalcs fbr the signal-generating enzymatic reactions are on the 
order of 0.02-1 .5 s (Supplementary Methods). The current reaction 
is dominated by mass transport effects, and improvements bused on 
faster delivery of reagents are possible. Well depth was selected on the 
basis of a number of competing requirements: (1 ) wells need to be 
deep enough for the DNA.-aurying beads co remain in the wells in the 
presence of convective transport past the wells; (2) they must be 
sufficiently deep to provide adequate isolation against diffusion of 
by-products from a well in which incorporation is taking place to a 
well where no incorporation is occurring; and (3) they must be 
shallow enough to allow rapid diffusion of nucleotides into the wells 
and rapid washing out of remaining nucleotides al the end of each 
flow cycle to enable high sequencing throughput and reduced reagent 
use. After the flow of each nucleotide, a wash containing apyrast: is 
used to ensure thai nucleotides da not remain in any well before the 
next nucleotide being introduced. 

Base colling of Individual reads 

Nucleotide incorporation is detected by the associated release of 
inorganic pyrophosphate and the generation of photons 5 ' 13 . Wells 
containing template-carrying beads are identified by detecting a 
known four-nueleotide 'key* sequence at the beginning of the read 




Figure 1 1 Sample preparation, a. Genomic DNA ia isolated, fragmented, 
lig&tedio adapters and separated in to single strands (top left). Fragments arc 
bound to beads under conditions that Favour one fragment per bead, the 
beads are caprured in the droplets of a PCR-re^ion-mixture-in-oii 
emulsion and PGR amplification occurs within eieh droplet, resulting in 
beads each carrying ten million copies of a unique DNA template (top right). 
The emulsion U broken, the DNA strands are denatured, and beads carrying 
singlc-Jtranded DNA clones arc deposited into wells of a fibre-optic slide 
(bottom right). Smaller beads carrying immobilized enzymes required fbr 
pyrophosphate sequencing arc deposited into each well (bottom left), 
b, Mkroacopc photograph of cmubkm showing droplets containing a bead 
And empty droplets. The thin arrow points to a 28-um bead; the thick arrow 
points to an approximately 100-H.m droplet, c, Scanning electron 
micrograph of a portion of « fibre-opt tc slide, showing fibre-optic cladding 
ind wells before bead deposition. 




Figure 2 | SequandrtS Instrument. The sequencing instrument consists of 
the following major *ub*ysremst a fluidic assembly 4 now chamber that 
includes the well-containing fibre-optic slide (b), a CCD camera-based 
imaging assembly (c), and a computer (hat provide* the necessary user 
interface and instrument control. 
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(Supplementary Methods). Raw signals are background-subtracted, 
normalized and corrected. The normalized signal intensity at each 
nucleotide flow, for a particular well, indicates the number of 
nucleotides, if any, that were incorporated. This linearity in signal 
is preserved to at least homopolymers of length eight (Supplemen- 
tary Fig. 6). In sequencing by synthesis a very small number of 
templates on each bead lose synchronism (that is, either get ahead of, 
or fall behind, all other templates in sequence"). The effect is 
primarily due to leftover nucleotides in a well (creating 'carry 
forward') or to incomplete extension. Typically, twb observe a carry 
forward rate of 1-2% and an incomplete extension rate of 0.1-0.3%. 
Correction of these shifts is essential because the loss of synchronism 
is a cumulative effect that degrades the quality of sequencing 'at 
longer read lengths. We have developed algorithms, based on detailed 
models of the underlying physical phenomena, that allow us to 
determine, and correct for, the amounts of carry forward and 
incomplete extension occurring in individual wells (Supplementary 
Methods). Figure 3 shows the processed result, a 1 13-bases4ong read 
generated in the M. genitahum run discussed below. To assess 
sequencing performance and the effectiveness of the correction 
algorithms, independently of artefacts introduced during the emul- 
sion -based sample preparation, we created test fragments with 
difficult-to-sequence stretches of identical bases of increasing length 
(homopolymers) (Supplementary Methods and Supplementary Fig. 
4). Using these test fragment^ wc have verified that at the individual 
read level wc achieve base call accuracy of approximately 99.4%, at 
read lengths in excess of 100 base* (Table 1 ). 

Hlgh-qu?fity reads and consensus accuracy 

Before base calh'ng or aligning reads, wc select high-quality reads 
without relying on a priori knowledge of the genome or template 
being sequenced (Supplementary Methods). This selection is based 
on the observation that poor-quality reads have a high proportion of 
signals that do not allow a dear distinction between a flow during 
which no nucleotide was incorporated and a flow during which one 
or more nucleotide was incorporated. When base calling individual 
reads, errors can occur because of signals that have ambiguous values 
(Supplementary Fig, 5). To improve the usability of our reads, we also 
developed a metric that allows us to estimate ab initio the quality (or 
probability of correct base call) of each base of a read, analogous to 
the Phred score 17 used by current Sanger sequencers (Supplementary 
Methods and Supplementary Fig* 3). 
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Figure 3 | Flowgram of a 1t3-ba*e* read from an M. amJtaJium 
run. Nucleotide* «rc flowed in the order T.A.CG. The frequence is shown 
above the flov^nun. The signal Yalue intervals corresponding to the various 
homopolymers arc indicated on the right. The first four bases (in red, above 
the flowgnun) constitute the 'key* sequence, used to identify wells containing 
n DNA-carrying bead. 



Higher quality sequence can be achieved by taking advantage of 
the high over sampling that our system affords and building a 
consensus sequence. Sequences arc aligned to one another using 
the signal strengths at each nucleotide flow, rather than individual 
base calls, to determine optimal alignment (Supplementary 
Methods). The corresponding signals are then averaged, after 
which base calling is performed- This approach greatly improves 
the accuracy of the sequence (Supplementary Fig. 7) and provides an 
estimate of the quality of the consensus base. Wc refer to that quality 
men sure as the Z-score— -it is P measure of the .spread of signals in all 
the reads at one location and the distance between the average signal 
and the closest base-calling threshold value. In both re-sequendng 
and dz novo sequencing, as the minimum Z-seore is raised the 
consensus accuracy increases,, while coverage decreases; approxi- 
mately half of the excluded bases, as the Z-score is increased, belong 
to homopolymers of length four and larger. Sanger sequencers 
usually require a depth oF coverage at any base of three or more in 
order to achieve a consensus accuracy of 99.99%. lb achieve a 
minimum of threefold coverage of 95% of the unique portions of a 
typical genome requires approximately seven- to eightfold over 
sampling. Owing to our higher error rare, we have observed that 
comparable consensus accuracies, over a similar fraction of a gen- 
ome, are achieved with a depth of coverage of four or more, requiring 
approximately ten to twelve times over sampling. 

Mycoplasma genltollum 

Mycopla&rta genomic DNA was fragmented and prepared into a 
sequencing library as described above. (This Was accomplished by a 
single individual in 4 h.) After emulsion polymerase chain reaction 
(PCR) and bead deposition onto a 60 x 60 mm 2 fibre-optic slide, a 
process which took one individual 6 h, 42 cycles of four nucleotides 
were flowed through the sequencing system in an automated 4-h run 
of the instrument. The results are summarized in Table 2. In order to 
measure the quality of individual reads, we aligned each high quality 
read tn the reference genome at 70% stringency using flow-space 
mapping and criteria similar to those used previously in assessing tbe 
accuracy of other base callers 17 . When assessing sequencing quality, 
only reads that mapped to unique locations in the reference genome 
were included. Because this process excludes repeat regions (parts of 
the genome for which corresponding flowgrems are 70% similar to 
one another), the selected reads did not cover the genome comple- 
tely. Figure 4a illustrates the distribution of read lengths for this run. 
The average read length was 1 ] 0 base** the resulting over sample 40- 
fbld, and 84,011 reads (27.4%) were perfect. Figure 4b summarizes 
the average error as a function of base position. Coverage of non- 
repeat regions was consistent with the sample preparation and 
emulsion not being biased (Supplementary Fig. 8). At the individual 
read level, wc observe an insertion and deletion error rate of 
approximately 3.3%; substitution errors have a much lower rate, 
on the order of 0.5%. When using these reads without any Z-score 
restriction, we covered 99.94% of the genome in ten contiguous 
regions wilh a consensus accuracy of 99-97%. The error rale in 
homopolymers is significantly reduced in the consensus sequence 
(Supplementary Fig. 7). Of the bases not covered by this consensus 



Table 1 1 Summary of sequencing statistics lor test fragments 



Si?e of fibre-ppfic atide 


60 x 60 mm 2 


Run ttmi/numbsr of cycles 


243mh/A2 


Test fragment reads 


497,893 


Avcra£c read length (bases) 


ioa 


Number of bases in test fragments 


53,705.267 


fiaii*3 with * Phred score of 20 and »bove 


47,191,792 


Individual rszd insertion error rate 


0.44% 


Individual read deletion error rate 


0.15% 


Individual r;cd substitution error rate 




All enters 


0.60% 
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sequence (366 bases), all belonged to excluded repeat regions. Setting 
a minimum Z-scorc equal to 4> coverage was reduced to 9U.i% of the 
genome, while consensus accuracy increased id 99*996%. We further 
demonstrated the reproducibility at the system by repeating the 
whole-genome sequencing of Af, genitalium an additional eight 
times, achieving a 40-fold coverage of the genome in each of the 
eight separate instrument runs (Supplementary Table 3). 

We assembled the M genhaUum reads from a single run into 25 
contigs with an average length of 22.4 kb- One of these contigs was 
mis assembled due to a collapsed tandem repeat region of 60 bases, 
and was corrected by hand. The original sequencing of hi. genitalium 
resulted in 28 contigs before directed sequencing used for finishing 
the sequence 1 '- Our assembly covered 96.54% of the genome and 
attained a consensus accuracy of 99.96%. Non-resolvable repeat 
regions amount to 3% of the genome; we therefore covered 99.5% 
of the unique portions of the genome. Sixteen of the breaks between 
cantigs were due to non- resolvable repeat regions, two were due to 
missed overlapping reads (our read filter and trimmer are not perfect 
and the algorithms we use to perform the pattern matching of 
flowgroms occasionally miss valid overlaps) and the remainder to 
thin read coverage. Setting a minimum Z-sc ore of A, coverage was 
reduced to 95.27% of the genome (98.2% of the resolvable part of the 
genome) with the consensus accuracy increasing to 99.994%. 

Discuss; Ion 

We have demonstrated the simultaneous acquisition of hundreds of 
thousands of sequence reads, &0-120 bases long, at 96% average 
accuracy in a single run of the instrument using a newly developed in 
vitro sample preparation methodology and sequencing technology. 
With Phrcd 20 as a cutoff, wc shov/ that our instrument is able to 
produce over 47 million bases from test fragments and 25 million 
bases from genomic libraries. We used test fragments to de- couple 
our sample preparation methodology from our sequencing technol- 
ogy- The decrease in single- read accuracy from 99.4% for test 
fragments to 96% for genomic libraries is primarily due to a lack 
of clonality in a fraction of the genomic templates in the emulsion, 
and is not an inherent limitation of the sequencing technology. Most 
of the remaining errors result from a broadening of vignal distri- 
butions, particularly for large hornopolymc™ (seven or more), 
leading to ambiguous base calls. Recent work on the sequencing 



Tabte 2 | Summary statistics far M. g&njtalium 



Sequencing summary 




Number of instrument runs 


1 


Si;* gf fibre-op tic slide 


60 x 60 mm 7 


Run time/number of eye lei 


243 m in/42 


High quality reads 


306,178 


Averase read length (bases) 


TtO 


Number of bases in high quality reads 


33,655353 


Bases with a Phred score of 20 and above 


36,753,540 


Re- sequencing 


Reads mapped to single location* 


238.066 


Number of bases in mapped reads 


27,637,747 


Individual read insertion error rate 


1.67% 


individual read deletion error rale 


1-60<* 


Individual read substitution error rate 




Re-seoucneing consensus 




Average over sampling 


X40 


Coverage, sir (z ^ 4) 


99.9% (98.2%) 


Consensus accuracy, «ll (2 a 4) 


99.97% (99.996%) 


Consensus insertion error rate, all (£ s= 4) 


0.02% (0.003%) 


Consensus deletion error rate, all (2 £ 4) 


0.01% (0.002%) 


Consensus iubirby^n error rate, all <Z as 4) 


0.001% (0.0003%) 


Number of contigs 


10 


D* assembly 




Coverajte. all (Z & 4) 


9634% (95.27%) 


Consensus accuracy, all CZ * 4) 


99.96% (99.994%) 


Number of contigs 


25 


Average contts size (kb) 


224 



chemistry and algorithms that correct for crosstalk between wells 
suggests that the signal distributions will narrow, with an attendant 
reduction in errors and increase in read lengths. In preliminary 
experiments with genomic libraries that also include improvements 
in che emulsion protocol, we axe able to achieve, using 84 cycles, read 
lengths nf 200 bases with accuracies similar to those demonstrated 
here for J 00 bases. On occasion, at 168 cycles, we have generated 
individual reads that are 100% accurate over greater then 400 bases. 

Using Atf. genitalium we demonstrate that short fragments a priori 
do not prohibit the dt novo assembly of bacteria] genomes. In fact, the 
larger over sampling afforded by the throughput of our system 
resulted in a draft sequence having fewer contigs than with Sanger 
reads, with substantially less effort. By taking advantage of the over 
sampling, consensus accuracies greater then 99.96% were achieved 
for rhis genome. Further quality Altering of the assembly results in 
the selection of a consensus sequence with accuracy exceeding 
99.99% while incurring only a minor loss of genome coverage. 
Comparable results were accn when we shotgun sequenced and de 
novo assembled the 2.1 -Mb genome of Streptococcus pncttnioniae* 4 
(Supplementary Table 4), The da novo assembly of genomes more 
complex than bacteria, including mammalian genomes, may require 
the development of methods similar to those developed for Sanger 
sequencing, to prepare and sequence paired end libraries that can 
span repeats in these genome. To facilitate the use of paired end 
libraries we have developed methods to sequence, in an individual 
well, from both ends of genomic template, and plan to add paired end 
read capabilities to our assembler (Supplementary Methods). 

Future increases in throughput, and a concomitant reduction in 
COSC per base, may come from the continued miniaturization of the 
fibre-optic reactors, allowing more sequence to be produced per unit 
area— a scaling characteristic similar co that which enabled the 
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Average read reogin: 109.9 bp 
Standard deviation; 20.1 



50 



100 ISO 
Read length flbp) 



250 




The individual read error rates are referenced n> tho total nuf>bcr of bases In mapped reada 



100 120 140 



Fl&Ur* 4 1 ML Mnttathm data, a, Read length distribution for the 306.176 
high-quality rcwU of the M, genitaHum sequencing run. This distribution 
reflects the base c (imposition of individual sequencing templates, b, Average 
read accuracy, at the single read level, as a function of base position for the 
238,066 mapped reads of the same run. 
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prediction c>f significant improvements in the integrated circuit at the 
scan of ics development cycle". 

METHODS 

Emulslon-baaed ck>na.l jmplification. The simultaneous amplifies tioa of frag- 
ments is achieved by isolating individual DNA»cn trying beads in separate — 100- 
pm aqueous droplets (on the order of 2 x 1 0 6 mi" 1 ) made through the creation 
of a PCH-rtjictiOTi-mixtvrc-in-oil emulsion. (Hg, lb; sec also Supplementary 
Methods). The droplet* act »s scpame microrcactors in which parallel DNA 
amplifications are performed, yielding approximately 10 7 copies of a template 
per bead; 800 pi of emulsion containing 1.5 million bead* are prepared in a 
standard 2-mI tube, E*ch emulsion » aliquoted into eight PCR tube* for 
amplification- After PCR, the emulsion is broken ro release the beads, which 
in dude beads with amplified, immobilised DNA template and empty beads 
(Supplementary Methods). Wc then enrich for template-tarrying beads (Sup- 
plementary Methods). Typically, about 30% of the beads will have DNA, 
producing 450,000 template-carrying beads per emulsion reaction. Tht number 
of emulsions prepared depends ort die size of the genome and the expected 
number of runs required to achieve adequate over sampling. The 500-kb M- 
g&tiuilittM genome, sequenced On one 60 * 60 mm 2 fibre-optic slide, required 
1.6 ml of emulsion. A human genome, over sampled ten times, would require 
approximately 3.000 ml Of emulsion, 

Bead loading into pieoUrre wells. The enriched temphte-caiTying beads arc 
deposited by cennifugation imo open welh (Fig. lc), arranged along one face of a 
60 X 60 mm 3 fibre-optic slide. The beads (diameter ~ 28 urn) arc sized to ensure 
thai no more than one bead fiu in mOit wells (we observed that 2-5% of filled 
wells contain mow than one bead). Loading 450*000 beads (from one emulsion 
preparation) On ( 0 each half of a 60 X 60mm : plate was experimentally found to 
limit bftad occupancy to approximately 35% of all wells, thereby reducing 
chemical and optical crosstalk between wells. A mixture of smaller beads thai 
carry immobilized ATP suiphurylMC and ludfcrasc necessary to generate light 
from free pyrophosphate arc also loaded into the wells to create the individual 
sequencing reactors (Supplementary Methods). 

Image capture. A bead carrying 10 million copies, of a ism plate yields 
approximately 10,000 photons at the CCD sensor, per incorporated nucleotide. 
The generated light is transmitted through the base of the fibre-optic slide and 
detected by a large format CCD (4>095 X 4*096 pixels). The images are processed 
to yield sequence information simultaneously for all wells containing template- 
carrying beads. The imaging system was designed to accommodate a large 
number of smalt welh and the large number of optical signals being generated 
from individual wells during each nucleviide flow. On« mounted, the fibre- 
op tic slide's position does not shift; ihU make* it passible for the image analysis 
software lo determine Ihe location of tach well (whether or not it contains a 
DNA-carrying bead)* based on light generation during the flow of a pyropho- 
sphale solution, which precede* each sequencing run. A single well is imaged by 
approximately nine 1 5 urn pixels. For each nucleotide flow, the light intensities 
collected by ihe pixels covering a particular well are summed to generate* signal 
for that particular well at that particular nucleotide Row. Each Image captured by 
the CCD produces 32 megabytes of data. In order to perform aU of the necessary 
signal processing in real time, the control computer U fried with an accessory 
board (Supplementary Methods), hosting a $ million gate Held Programmable 
Gate Array (FPGA)" 33 . 

De novo shotgun sequence assembler. A dr novo How-space assembler was 
developed to capture all of the information contained in the original flow-based 
ognal trace Tl also addresses the Tact Omi existing assemblers arc not optimized 
for 8G-l20-hases reads, pari iculeriy with respect to memory management due to 
ihe increased number of sequencing rends needed to achieve equivalent genome 
coverage. (A completely nindom genome covered with 100 -bases read? requires 
approximately 50% more reads (0 yield the same number of contiguous regions 
(coniiga) as achieved with 700-baies reads, assuming the need for a 30-bases 
overlap between reads 11 .) This assembler consists of a series of modules: the 
Overlap pi! r, which finds and creates overlaps between reads; the Uniu'gger, which 
constructs larger COrtl igs of overlapping sequence reads: and the MuUialigneT, 
which generates consensus calls and quality scores for the bases within each 
ccmtig(Supplcmeqiary Methods). (The names of the software modules are based 
on chose performing related function* in other assemblers developed pre- 
viously 33 .) 
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